TCR Clustering Demo

Load Data

# file paths
cd8_filepath <- here::here("./data/CD8_Stims.rds")

# download
invisible(capture.output(Rdiscvr::DownloadOutputFile(
  outputFileId = 785508,
  outFile = cd8_filepath,
  overwrite = FALSE,
)))


# read
seuratObj_CD8s <- readRDS(cd8_filepath)

# collapse the different gag stims into a single category
seuratObj_CD8s$Stim <- ifelse(
  grepl("CM9|Gag", seuratObj_CD8s$Stim),
  yes = "Gag",
  no = seuratObj_CD8s$Stim
)

Clustering

Omitted: Run TCR Clustering (Probably already done on prime-seq for your data)

Tuning Clustering Parameters

TCR clustering is not a one-size-fits-all procedure. Different datasets, TCR chains, organisms, experimental designs, and biological questions may require different clustering parameters.

The fastest way to determine the suitability of a TCR cluster is to use the tooling in tcrClustR to visualize the clusters in their native distance space.

These heatmaps and histograms are critical to our understanding of what the necessary parameterization is.

The VisualizeTcrDistances function iterates through each of the different clusterings/distances calculated on the TCR data, and shows the distance distributions within and between clusters.

There are two primary differences to consider as you begin to tune the clustering parameters.

  1. Which chain are you clustering on? (TRA, TRB, or both)
  2. Are you using full length distances or just CDR3?

tcrClustR enumerates the different combinations of these two factors and then clusters them, but does so with a single dianaHeight parameter value.

Notice in the heatmaps above that the color scale varies in magnitude. This is because the distance distributions vary between chains and distance types. A dianaHeight height of 50 may be reasonable for TRB full-length distances, too large for TRA_CDR3, and too small for TRA+TRB full length distances.

Depending on your assessment, you can re-run the clustering locally with a different dianaHeight parameter to better suit your data. This doesn’t take long. You may also specify chains, but it is trivial to re-run all chains using a specific height.

Visualization Heatmaps

tcrClustR::VisualizeTcrDistances(
  seuratObj = seuratObj_CD8s,
  showDimplotLegend = FALSE
)

End of Heatmaps

Local Clustering

dianaHeight = 20 (default)

seuratObj_CD8s <- RunTcrClustering(
  seuratObj_TCR = seuratObj_CD8s,
  chainsToCluster = NULL, # if null, all assays will be processed
  clusteringMethod = "DIANA",
  dianaHeight = 20, # this value, in units of TCR distance, will determine roughly how different clones within the same cluster may be.
  clusterSizeThreshold = 2, # removes/filters singleton clusters
  verbose = TRUE
)

Here’s a block that will allow you to visualize the clustering results again after re-clustering with a new parameter using the new height. Most of these are singletons, so they’re given a cluster value 0 here, which is converted to an NA in the Seurat Object’s metadata.

library(ComplexHeatmap)
library(circlize)
# helper function for retaining ggplot colors
gg_color_hue <- function(n) {
  hues <- seq(15, 375, length = n + 1)
  hcl(h = hues, l = 65, c = 100)[1:n]
}

tcr_distances <- seuratObj_CD8s@misc$TCR_Distances$TRB_fl$counts
annotation_df <- seuratObj_CD8s@misc$TCR_Distances$TRB_fl@meta.features

cluster_range <- range(annotation_df$TRB_fl_ClusterIdx, na.rm = TRUE)

col_fun <- colorRamp2(
  breaks = seq(cluster_range[1], cluster_range[2], length.out = 10),
  colors = gg_color_hue(10)
)

top_annotation <- ComplexHeatmap::HeatmapAnnotation(
  df = annotation_df %>% dplyr::select(TRB_fl_ClusterIdx),
  col = list(TRB_fl_ClusterIdx = col_fun), # Use the function here
  show_legend = TRUE,
  show_annotation_name = TRUE
)

ComplexHeatmap::Heatmap(
  as.matrix(tcr_distances),
  name = "TCR Distance",
  show_row_names = FALSE,
  show_column_names = FALSE,
  cluster_rows = TRUE,
  cluster_columns = TRUE,
  top_annotation = top_annotation
)

And then by relaxing the clustering threshold:

dianaHeight = 200

Plots generated by this block are suppressed for brevity.

seuratObj_CD8s <- RunTcrClustering(
  seuratObj_TCR = seuratObj_CD8s,
  chainsToCluster = NULL, # if null, all assays will be processed
  clusteringMethod = "DIANA",
  dianaHeight = 200, # this value, in units of TCR distance, will determine roughly how different clones within the same cluster may be.
  clusterSizeThreshold = 2, # removes/filters singleton clusters
  verbose = FALSE
)

For this specific dianaHeight = 200 value, we only get 23 clusters, so we can introspect them visually.

tcr_distances <- seuratObj_CD8s@misc$TCR_Distances$TRB_fl$counts
annotation_df <- seuratObj_CD8s@misc$TCR_Distances$TRB_fl@meta.features
colors <- gg_color_hue(n = length(unique(annotation_df$TRB_fl_ClusterIdx)))
names(colors) <- unique(annotation_df$TRB_fl_ClusterIdx)

top_annotation <-
  ComplexHeatmap::HeatmapAnnotation(
    df = annotation_df %>%
      dplyr::select(TRB_fl_ClusterIdx) %>%
      dplyr::mutate(TRB_fl_ClusterIdx = as.factor(TRB_fl_ClusterIdx)),
    col = list(TRB_fl_ClusterIdx = colors),
    show_legend = TRUE,
    show_annotation_name = TRUE
  )

ComplexHeatmap::Heatmap(
  as.matrix(tcr_distances),
  name = "TCR Distance",
  show_row_names = FALSE,
  show_column_names = FALSE,
  cluster_rows = FALSE,
  cluster_columns = FALSE,
  column_split = annotation_df$TRB_fl_ClusterIdx,
  row_split = annotation_df$TRB_fl_ClusterIdx,
  top_annotation = top_annotation
)

Notice that in cluster 2, there are a few distinct low-distance blocks that are now being grouped together.

This may be acceptable depending the biochemical properties of the TCR itself, but we cannot assess those in this analysis. Thus, you may want to re-run the clustering with a smaller dianaHeight parameter. It’s generally easier to screen more specific clusters with higher confidence that the TCRs are similar, than to try to interpret large, heterogeneous clusters. So, I will re-run the clustering with the default dianaHeight parameter of 20 for this demo.

Visualize Clustering

These data contain PBMCs from subjects (SubjectId) in a variety of stimulation conditions (Stim). We’ll evaluate the distribution of TCR clusters (TRB_fl_ClusterIdx) across subjects and stim conditions.

During in vivo experiments or stimulations with a sort, we would expect the absolute abundance of clones comprising the TCR clusters to change in response to unobserved/assumed changes antigen exposure or response due to proliferation or increases in cellular trafficking. We don’t expect the same of short-term in vitro stimulations such as this dataset, but we can tabulate them anyway.

The gross goal of this plot is for clusters with lots of cells in the experiment to rise to the top of this plot, and for clusters with few cells to sink to the bottom. This will help us quickly identify clusters of interest.

# Visualize CD8 T cell clustering

plot_df <- seuratObj_CD8s@meta.data %>%
  group_by(SubjectId, Stim, DatasetId) %>%
  mutate(total_cells = n()) %>%
  group_by(SubjectId, Stim, TRB_fl_ClusterIdx, DatasetId, total_cells) %>%
  summarize(cluster_cells = n()) %>%
  ungroup()


# cellular abundance
plot_df %>%
  ggplot(aes(
    x = DatasetId, y = log10(cluster_cells + 1),
    label = TRB_fl_ClusterIdx,
    group = interaction(TRB_fl_ClusterIdx, Stim),
    color = SubjectId
  )) +
  geom_point() +
  geom_line(alpha = 0.5) +
  ggrepel::geom_text_repel(max.overlaps = 15) +
  facet_wrap(~Stim, scales = "free_x") +
  egg::theme_article() +
  theme(axis.text.x = element_blank())

# percentages of total cells per dataset
plot_df %>%
  ggplot(aes(
    x = DatasetId, y = cluster_cells / total_cells,
    label = TRB_fl_ClusterIdx,
    group = interaction(TRB_fl_ClusterIdx, Stim),
    color = SubjectId
  )) +
  geom_point() +
  geom_line(alpha = 0.5) +
  ggrepel::geom_text_repel(max.overlaps = 15, show.legend = FALSE) +
  facet_wrap(~Stim, scales = "free_x") +
  egg::theme_article() +
  theme(axis.text.x = element_blank()) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 0.1))

Moreover, one may want to visualize the distribution of clusters in reduced dimensional space, such as a UMAP. For this, it is beneficial to use a whitelist of cluster labels to highlight groups, and compare them to clusters to see if the clonotypes are restricted to specific cellular phenotypes.

seuratObj_CD8s$ClustersOfInterest <- ifelse(
  seuratObj_CD8s$TRB_fl_ClusterIdx %in% c(887, 2042),
  as.character(seuratObj_CD8s$TRB_fl_ClusterIdx),
  "Other"
)

CellMembrane::PlotSeuratVariables(
  xvar = "ClusterNames_0.4",
  yvar = "ClustersOfInterest",
  seuratObj = seuratObj_CD8s,
  label = T
)

Activation Status of Cells within Clusters

One may want to evaluate the activation status of cells in specific clusters of interest. This can be done by comparing expression of activation markers across clusters.

Historically, we’ve done this via thresholding UCell scores on activation signatures, but one could also use classifiers or other methods. Here, I’ll create a variable by thresholding a UCell score.

Define Activated T Cells

seuratObj_CD8s$ActiveTCell <- ifelse(
  seuratObj_CD8s$TandNK_ActivationCore_UCell > 0.5,
  "Activated",
  "NotActivated"
)

Visual Investigation of clusters enriched in activated T cells

The Bimberian Clonotype Plot

The clonotype plot prioritizes showing clone (or in this case, cluster) activation proportionally. The main inferences are to assess the clonality of activated cells, or said more simply: whether or not the activated cells come from a single clone/cluster, or multiple. The clonotype plot allows for one to quickly assess the diversity within the current stimulation condition and across stimulation conditions.

However, It is difficult to interpret the proportions directly due to variability in the size of clones. The activation rate is per dataset, so in datasets with very few activated cells (e.g. 5) the proportion of singletons dominates in a way that is not very obvious to someone unfamiliar with the plot (e.g. few cells [1/5] but a large proportion [20%]). However, for larger clones, the proportion of activated cells is more stable to small variations in the number of responding clones. So, simplicity (e.g. presentations) is a key usage for this plot, but interpretability suffers.

These plots are designed to isolate clones, so having a single cluster set to a negative bin (singletons are set to Cluster 0 internally, then set to NA in the metadata)

for (subjectid in unique(seuratObj_CD8s$SubjectId)) {
  clonotype_plot <- Rdiscvr::MakeClonotypePlot(
    seuratObj = seuratObj_CD8s,
    subjectId = subjectid,
    chain = "TRB_fl_ClusterIdx",
    xFacetField = "Stim",
    activationFieldName = "TandNK_ActivationCore_UCell",
    threshold = 0.5,
  )
  print(clonotype_plot)
}

## [1] "Dropping group field with single value: Tissue"

## [1] "Dropping group field with single value: AssayType"

## [1] "Dropping group field with single value: AssayType"

Scatter Plots

These next plots will greedily sacrifice the simplicity for enhanced interpretability. We’ll scatter the number of activated cells against the total number of cells in the cluster for each stimulation condition.

We’ll investigate two such scatter plots. The first prioritizes reproducibility over biological replicates (different DatasetId, same subject, same stim). The lines connect points from the same cluster across datasets, so elevated and connected lines represent consistent activation across replicates. These are clones we can recover reliably by sampling, and their response rate is reproducible.

We’ll connect the repeated measures (subjects) with lines, so that single dots represent clusters that were only observed in one replicate, and ‘streaks’ of connected dots represent clusters that were observed in multiple replicates. Consistent slopes indicate that the number of responding cells is reproducible across replicates and should give you confidence in the cluster’s response rate.

plot_df <- seuratObj_CD8s@meta.data %>%
  group_by(SubjectId, DatasetId, TRB_fl_ClusterIdx) %>%
  mutate(cluster_cells = n()) %>%
  group_by(SubjectId, Stim, DatasetId, SampleDate, TRB_fl_ClusterIdx, ActiveTCell) %>%
  summarize(activated_cells = n(), cluster_cells = cluster_cells, proportion = activated_cells / cluster_cells) %>%
  # this filter removes non-activated T cells and cells without a cluster assignment (singletons and cells with a TCR called, but missing a TRB).
  filter(ActiveTCell == "Activated" & !is.na(TRB_fl_ClusterIdx) & !(TRB_fl_ClusterIdx == 0)) %>%
  ungroup() %>%
  unique.data.frame()
ggplot(
  plot_df,
  aes(
    x = cluster_cells + 1,
    y = activated_cells + 1,
    label = paste0("ID:", TRB_fl_ClusterIdx, ",\n", round(proportion, 2) * 100, "%"),
    group = interaction(TRB_fl_ClusterIdx, Stim),
    color = SubjectId
  )
) +
  geom_abline(linetype = 2, slope = 1, intercept = 0) +
  ggrepel::geom_text_repel(max.overlaps = 15) +
  geom_point() +
  geom_line(alpha = 0.5) +
  facet_grid(. ~ Stim) +
  egg::theme_article() +
  scale_x_log10() +
  scale_y_log10()

The second plot prioritizes biological replicates (different stims, same subject). Here, elevated and connected lines represent consistent activation across stim conditions, which may indicate antigen-specific or non-specific activation motifs (depending on the stimulation condition)

ggplot(
  plot_df,
  aes(
    x = cluster_cells + 1,
    y = activated_cells + 1,
    label = paste0("ID:", TRB_fl_ClusterIdx, ",\n", round(proportion, 2) * 100, "%"),
    group = interaction(TRB_fl_ClusterIdx, Stim),
    color = Stim
  )
) +
  geom_abline(linetype = 2, slope = 1, intercept = 0) +
  ggrepel::geom_text_repel(max.overlaps = 15, show.legend = FALSE) +
  geom_point() +
  geom_line(alpha = 0.5) +
  facet_grid(. ~ SubjectId) +
  egg::theme_article() +
  scale_x_log10() +
  scale_y_log10() +
  xlab("Total Cells in Cluster + 1 (log10 scale)") +
  ylab("Activated Cells in Cluster + 1 (log10 scale)")

See specifically: this sub-panel (Subject 32726 and Subject 24957) whose cluster 887 and cluster 2042 shows right-skewed (large & activated) populations.

highlight_df <- plot_df %>%
  mutate(Stim = factor(Stim)) %>%
  filter(SubjectId == "32726" | SubjectId == "24957", .preserve = TRUE) %>%
  filter(TRB_fl_ClusterIdx %in% c(887, 2042), .preserve = TRUE)

ggplot(
  highlight_df,
  aes(
    x = cluster_cells + 1,
    y = activated_cells + 1,
    label = paste0("ID:", TRB_fl_ClusterIdx, ",\n", round(proportion, 2) * 100, "%"),
    group = interaction(TRB_fl_ClusterIdx, Stim),
    color = Stim,
    shape = Stim
  )
) +
  geom_abline(linetype = 2, , slope = 1, intercept = 0) +
  ggrepel::geom_text_repel(max.overlaps = 15, show.legend = FALSE) +
  geom_point() +
  geom_line(alpha = 0.5) +
  facet_grid(~SubjectId, scales = "free_x") +
  egg::theme_article() +
  scale_x_log10() +
  scale_y_log10() +
  scale_color_manual(values = gg_color_hue(n = 7)[c(1, 6, 7)])

Cluster 887, while containing tens of activated cells in the VY9 stimulated condition, is also high for cells in the unstimulated condition (purple), suggesting that this cluster contains cytokine producing T cells that are not TCR specific.

Cluster 2042, however, is much more heavily skewed towards the AN10 stimulated condition, compared to it’s VY9 and and NoStim conditions, suggesting that this cluster contains an antigen-specific TCR motif.

Statistics

Part I: Quick, but shameful.

We can perform statistical tests to identify clusters that are significantly enriched for activated T cells compared to the rest of the T cell population. I’m prioritizing exploratory work in this demo, at the expense of publishable methods.

First, I’ll describe the lowest bar to get a p value. In general, do not do this. As long as you aren’t going to report this p value anywhere, I’m fine with internal use for filtering. We’ll use a Fisher’s Exact Test to compare the number of activated vs. non-activated cells in each cluster against all other cells.

When we do statistics, we make assumptions. These are those such assumptions:

  1. I care about the specific activation rate of a cluster (probably always true).
  2. I think that the other clusters’ represent reasonable background rates of activation. (not necessarily true).
  3. Clusters are truly independent of each other (probably never true, due to subject-level V/D/J biasing and dataset-dependent detection rates).

Gross violation of assumption 3 means that the p-values we get are not valid, as some samples/subjects may be more prone to activation than others, have different baseline activation rates, and clusters may share TCR motifs that make them more or less likely to be activated. However, if we use these p-values as a heuristic to identify clusters of interest for further study (i.e. rank them and look at the top 10), this is acceptable.

# p value adjustment method
# you can use 'holm' for greater stringency, but 'BH' is a reasonable default.
p_method <- "BH"

# sidedness for fisher:
# other options are 'less' and 'two.sided'
# I care only about clusters with elevated activation rates, so I'll use 'greater'.
sidedness <- "greater"

stat_results <- plot_df %>%
  group_by(SubjectId, Stim, TRB_fl_ClusterIdx) %>%
  summarize(
    activated_cells = sum(activated_cells),
    total_cells = sum(cluster_cells)
  ) %>%
  ungroup() %>%
  rowwise() %>%
  mutate(non_activated_cells = total_cells - activated_cells) %>%
  mutate(
    other_activated_cells = sum(plot_df$activated_cells) - activated_cells,
    other_non_activated_cells = sum(plot_df$cluster_cells) - total_cells - other_activated_cells
  ) %>%
  rowwise() %>%
  mutate(fisher_p_value = fisher.test(
    matrix(
      c(
        activated_cells,
        non_activated_cells,
        other_activated_cells,
        other_non_activated_cells
      ),
      nrow = 2
    ),
    alternative = sidedness
  )$p.value) %>%
  ungroup() %>%
  # you may change the adjust method here to 'holm', but always adjust your p values.
  mutate(adj_p_value = p.adjust(fisher_p_value, method = "BH"))

stat_results %>%
  arrange(adj_p_value)
## # A tibble: 84 × 10
##    SubjectId Stim   TRB_fl_ClusterIdx activated_cells total_cells
##    <chr>     <chr>              <dbl>           <int>       <int>
##  1 32726     AN10                2042             539        1199
##  2 32069     Gag                 1266             149         375
##  3 32069     Gag                 1096             143         355
##  4 32069     Gag                 1060             140         366
##  5 24957     VY9                  887              50          75
##  6 32069     Gag                 1143             135         371
##  7 32069     Gag                 1098              65         169
##  8 32069     Gag                 1061              72         203
##  9 24957     NoStim               887              47         108
## 10 32069     Gag                  193             132         501
## # ℹ 74 more rows
## # ℹ 5 more variables: non_activated_cells <int>, other_activated_cells <int>,
## #   other_non_activated_cells <int>, fisher_p_value <dbl>, adj_p_value <dbl>

With a p value, we can use a volcano plot to identify clusters with both high activation proportions and significant p values.

Notice:

  • we rediscover the clusters we highlighted before (887and 2042 in subject 32726)
  • we pooled the data across technical replicates (DatasetId) for this analysis, so if a cluster was only observed in one replicate, we should trust it less, but the fisher’s exact test doesn’t know anything about that.
ggplot(
  stat_results,
  aes(
    x = activated_cells / total_cells,
    y = -log10(adj_p_value),
    label = TRB_fl_ClusterIdx,
    color = SubjectId,
    shape = Stim
  )
) +
  geom_point() +
  ggrepel::geom_text_repel(max.overlaps = 15, show.legend = F) +
  egg::theme_article() +
  scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
  xlab("Percentage of Activated Cells in Cluster") +
  ylab("-log10 Adjusted P Value")

Part II: Slow, painstaking, and more appropriate.

Logistic regression or mixed-effects models would be a better approach to account for these intra-subject and intra-dataset correlations, but are reliant on knowing the experimental design, structure of the data of your experiment, and the likely biological correlations therein. Exhaustively listing these factors in general is both impossible and beyond the scope of this demo.

For specifically this experiment, I want to compare the AN10 and VY9 stims against their respective NoStim condition. I’ll subset the data to only include these three conditions in a logistic regression to test if the activation rates are different between stims in each cluster. A finalized version of this analysis would include all clusters, and then I would correct their p values. For demonstration purposes, I’ll just test the two clusters we highlighted before (887and 2042).

library(lme4)
subset_df <- seuratObj_CD8s@meta.data %>%
  # define activation state
  mutate(IsActive = ifelse(TandNK_ActivationCore_UCell > 0.5,
    yes = "Activated",
    no = "NotActivated"
  )) %>%
  # binarize for logistic regression
  mutate(IsActive = ifelse(IsActive == "Activated", 1, 0)) %>%
  filter(Stim %in% c("AN10", "VY9", "NoStim")) %>%
  mutate(
    Stim = factor(Stim, levels = c("NoStim", "AN10", "VY9")),
    SubjectId = factor(SubjectId)
  ) %>%
  filter(!is.na(TRB_fl_ClusterIdx)) %>%
  filter(TRB_fl_ClusterIdx == 887 | TRB_fl_ClusterIdx == 2042)

mixed_effects_results <- data.frame()

for (cluster in unique(subset_df$TRB_fl_ClusterIdx)) {
  cluster_data <- subset_df %>%
    # define in-cluster status
    mutate(InCluster = ifelse(TRB_fl_ClusterIdx == cluster, 1, 0)) %>%
    # subset to cluster
    filter(InCluster == 1) %>%
    dplyr::select(IsActive, Stim, SubjectId, DatasetId)

  if (nrow(cluster_data) < 10) {
    print(paste("Skipping cluster", cluster, "due to insufficient data"))
    print(paste0("Only ", nrow(cluster_data), "cells."))
    next
  } else if (length(unique(cluster_data$Stim)) < 2) {
    next
  }

  model <- glmer(
    IsActive ~ Stim +
      (1 | SubjectId) +
      (1 | DatasetId),
    data = cluster_data,
    family = binomial
  )

  summary_model <- summary(model)
  coef_AN10 <- summary_model$coefficients["StimAN10", ]
  coef_VY9 <- summary_model$coefficients["StimVY9", ]
  odds_ratio_AN10 <- exp(coef_AN10[1])
  odds_ratio_VY9 <- exp(coef_VY9[1])
  cat("Cluster:", cluster, "Odds Ratio for AN10 vs NoStim:", odds_ratio_AN10, "\n")
  cat("Cluster:", cluster, "Odds Ratio for VY9 vs NoStim:", odds_ratio_VY9, "\n")
  # extract p values
  p_value_AN10 <- coef(summary_model)[2, 4]
  p_value_VY9 <- coef(summary_model)[3, 4]

  cat("Cluster:", cluster, "P-value for AN10 vs NoStim:", p_value_AN10, "\n")
  cat("Cluster:", cluster, "P-value for VY9 vs NoStim:", p_value_VY9, "\n")
  mixed_effects_results <- rbind(
    mixed_effects_results,
    data.frame(
      Cluster = cluster,
      OddsRatio_AN10 = odds_ratio_AN10,
      PValue_AN10 = p_value_AN10,
      OddsRatio_VY9 = odds_ratio_VY9,
      PValue_VY9 = p_value_VY9
    )
  )
}
## Cluster: 2042 Odds Ratio for AN10 vs NoStim: 133.1103 
## Cluster: 2042 Odds Ratio for VY9 vs NoStim: 5.0569 
## Cluster: 2042 P-value for AN10 vs NoStim: 1.309644e-13 
## Cluster: 2042 P-value for VY9 vs NoStim: 0.02103385
## Cluster: 887 Odds Ratio for AN10 vs NoStim: 4.288035 
## Cluster: 887 Odds Ratio for VY9 vs NoStim: 494.5063 
## Cluster: 887 P-value for AN10 vs NoStim: 0.5557912 
## Cluster: 887 P-value for VY9 vs NoStim: 0.0028803

Comparison

Then we can compare the two statistical tests we performed (subject/ agnostic activation rates with a stim-specific fisher exact test, and stim-specific activation rates with a mixed-effects logistic regression).

Vertical and horizontal lines are a common p value threshold of 0.05.

mixed_effects_results %>%
  pivot_longer(
    cols = c(PValue_VY9, PValue_AN10),
    names_to = "Stim",
    values_to = "PValue"
  ) %>%
  mutate(Stim = ifelse(Stim == "PValue_VY9", "VY9", "AN10")) %>%
  left_join(
    stat_results %>%
      filter(TRB_fl_ClusterIdx %in% mixed_effects_results$Cluster),
    by = c("Cluster" = "TRB_fl_ClusterIdx", "Stim" = "Stim")
  ) %>%
  pivot_longer(
    cols = starts_with("PValue"),
    names_to = "Test",
    values_to = "PValue"
  ) %>%
  ggplot(aes(
    x = -log10(fisher_p_value),
    y = -log10(PValue),
    label = Cluster,
    color = Stim,
    shape = SubjectId
  )) +
  geom_point() +
  geom_vline(xintercept = -log10(0.05), linetype = 2, alpha = 0.3) +
  geom_hline(yintercept = -log10(0.05), linetype = 2, alpha = 0.3) +
  ggrepel::geom_text_repel(max.overlaps = 15, show.legend = FALSE) +
  egg::theme_article() +
  xlab("-log10 Fisher's Exact Test P Value") +
  ylab("-log10 Mixed Effects Logistic Regression P Value")

We can see that in general, the p values from the fisher exact test are much smaller, likely due to the violation of independence assumptions we discussed earlier. The mixed-effects model accounts for intra-subject and intra-dataset correlations, leading to more conservative p values.

Let’s look specifically at cluster 2042:

mixed_effects_results %>%
  pivot_longer(
    cols = c(PValue_VY9, PValue_AN10),
    names_to = "Stim",
    values_to = "PValue"
  ) %>%
  mutate(Stim = ifelse(Stim == "PValue_VY9", "VY9", "AN10")) %>%
  left_join(
    stat_results %>%
      filter(TRB_fl_ClusterIdx %in% mixed_effects_results$Cluster),
    by = c("Cluster" = "TRB_fl_ClusterIdx", "Stim" = "Stim")
  ) %>%
  pivot_longer(
    cols = starts_with("PValue"),
    names_to = "Test",
    values_to = "PValue"
  ) %>%
  filter(Cluster == 2042) %>%
  dplyr::select(Cluster, Stim, Test, PValue, fisher_p_value, SubjectId) %>%
  unique.data.frame() %>%
  ggplot(aes(
    x = -log10(fisher_p_value),
    y = -log10(PValue),
    label = Cluster,
    color = Stim,
    shape = SubjectId
  )) +
  geom_point() +
  geom_vline(xintercept = -log10(0.05), linetype = 2, alpha = 0.3) +
  geom_hline(yintercept = -log10(0.05), linetype = 2, alpha = 0.3) +
  ggrepel::geom_text_repel(max.overlaps = 15, show.legend = FALSE) +
  egg::theme_article() +
  facet_wrap(~Test) +
  xlab("-log10 Fisher's Exact Test P Value") +
  ylab("-log10 Mixed Effects Logistic Regression P Value")

To be more blunt about the ramifications: the number of atoms in the universe is 10^78 to 10^82. If the p value claims that the null hypothesis has a smaller chance of observing the test statistic than randomly plucking a specific atom from the entire universe, that is a sign that the statistical test has been misused.

Next Steps

Now, we have a basic framework for visualizing TCR clusters, evaluating their activation status, and evaluating them.

If you wish to proceed further, consider the following next steps:

  1. Now that you know which clusters contain cells of interest, you may want to perform small adjustments to the clustering parameters to refine your clusters. Write the cell barcodes to a file, increase the dianaHeight parameter slightly, and evaluate the sequences of the new clones against the previously existing ones to see if being more inclusive is a better fit for your data.

  2. Consider downstream methods for evaluating your clonotypes, such as converting them into position weight matrices for motif discovery or structural analyses. Or express in in a cell line and test for antigen specificity. The rest is up to you!

Conclusions

Now, you are empowered to visualize TCR clusters, evaluate their activation status, perform both basic statistical tests and a basic introduction to deal with confounding variables in these analyses, and tune the clustering post-hoc. Remember to visualize your data, consider the assumptions behind your statistical tests, and choose methods that best fit your experimental design and data structure.

Happy clustering!